EDA for SEPTA Bus Ridership

Authors

Angel Rutherford

Ixchel Ramirez

Tess Vu

Published

November 28, 2025

Potential Project?

Research Question: Which Census Tracts have structural, long-term increases in bus demand that communicates Philly needs permanent service expansion in the future?

Method: Spatio-temporal panel analysis using Negative Binomial regression with temporal lags, spatial lags, and tract-level fixed effects.

Data Scope: 2014-2023 SEPTA bus ridership across census tracts in Philadelphia region (5 counties). Restricted to fall season unfortunately, but this is actually one of the restrictions Dr. Delmelle was talking about addressing in the project, so it’s fine if it’s not thorough, she’s really just looking for our understanding of what we’re doing and what limitations we face with real-world data! :D

Key Stuff:

  • Ridership exhibits extreme overdispersion (variance > mean), justifying Negative Binomial model.
  • Strong temporal autocorrelation indicates lagged predictors could be critical.
  • Spatial clustering indicates need for spatial lag variables.
  • Weekday recovery differs substantially from weekend patterns.
  • County-level heterogeneity suggests fixed effects are required.

Setup and Initial Data Loading

Code
# Load libraries.
library(tidyverse)
library(lubridate)
library(sf)
library(scales)
library(patchwork)
library(viridis)
library(kableExtra)
library(corrplot)
library(ggridges)
library(knitr)

# Disable scientific notation.
options(scipen = 999)

# Consistent theme.
theme_set(theme_minimal(base_size = 12))

# Download viz when knitting.
opts_chunk$set(
  fig.path = "figures/", 
  dev = "png"
)
Code
# Core panel data for spatio-temporal analysis.
df_raw <- read.csv("data/septa/Bus_Ridership_by_Census_Tract.csv")

# Initial data structure and dimensions.
cat("Initial dataset dimensions:\n")
Initial dataset dimensions:
Code
print(dim(df_raw))
[1] 17891     9
Code
# First few rows to understand structure.
head(df_raw) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
County_Name Census_Tract_ID Census_Tract_Name Mode Season Day_of_Week On_ Off_ ObjectId
Bucks County 42017100102 Census Tract 1001.02 Bus Fall 2014 Weekday 42 53 1
Bucks County 42017100103 Census Tract 1001.03 Bus Fall 2014 Weekday 79 73 2
Bucks County 42017100104 Census Tract 1001.04 Bus Fall 2014 Weekday 88 69 3
Bucks County 42017100105 Census Tract 1001.05 Bus Fall 2014 Weekday 4 3 4
Bucks County 42017100201 Census Tract 1002.01 Bus Fall 2014 Weekday 484 350 5
Bucks County 42017100206 Census Tract 1002.06 Bus Fall 2014 Weekday 903 956 6
Code
# Column types.
str(df_raw)
'data.frame':   17891 obs. of  9 variables:
 $ County_Name      : chr  "Bucks County" "Bucks County" "Bucks County" "Bucks County" ...
 $ Census_Tract_ID  : num  42017100102 42017100103 42017100104 42017100105 42017100201 ...
 $ Census_Tract_Name: chr  "Census Tract 1001.02" "Census Tract 1001.03" "Census Tract 1001.04" "Census Tract 1001.05" ...
 $ Mode             : chr  "Bus" "Bus" "Bus" "Bus" ...
 $ Season           : chr  "Fall 2014" "Fall 2014" "Fall 2014" "Fall 2014" ...
 $ Day_of_Week      : chr  "Weekday" "Weekday" "Weekday" "Weekday" ...
 $ On_              : int  42 79 88 4 484 903 391 305 60 6 ...
 $ Off_             : int  53 73 69 3 350 956 549 455 88 6 ...
 $ ObjectId         : int  1 2 3 4 5 6 7 8 9 10 ...
Code
# Missing values by column.
colSums(is.na(df_raw)) %>%
  as.data.frame() %>%
  rename("Missing_Count" = ".") %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Missing_Count
County_Name 0
Census_Tract_ID 0
Census_Tract_Name 0
Mode 0
Season 0
Day_of_Week 0
On_ 0
Off_ 0
ObjectId 0

Data Cleaning and Feature Engineering

Code
# Cleaned dataset with temporal and categorical features.
df_clean <- df_raw %>%
  # Numeric year from Season column for time series analysis.
  mutate(Year = as.numeric(str_extract(Season, "\\d{4}"))) %>%
  # Remove rows with missing ridership counts that mess up aggregations.
  filter(!is.na(On_), !is.na(Off_)) %>%
  # Binary indicator for weekday vs. weekend patterns.
  mutate(
    Is_Weekday = ifelse(Day_of_Week == "Weekday", 1, 0),
    Day_Type = factor(Day_of_Week, levels = c("Weekday", "Weekend", "Sunday"))
  ) %>%
  # Pandemic period indicator for break analysis.
  mutate(
    Period = case_when(
      Year < 2020 ~ "Pre-Pandemic",
      Year == 2020 ~ "Pandemic",
      Year > 2020 ~ "Post-Pandemic"
    ),
    Period = factor(Period, levels = c("Pre-Pandemic", "Pandemic", "Post-Pandemic"))
  ) %>%
  # Convert Census Tract ID to character for joining.
  mutate(Census_Tract_ID = as.character(Census_Tract_ID))

# Summary stats of cleaned data showing key distributions.
cat("Cleaned dataset dimensions:\n")
Cleaned dataset dimensions:
Code
print(dim(df_clean))
[1] 17891    13
Code
summary(df_clean) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
County_Name Census_Tract_ID Census_Tract_Name Mode Season Day_of_Week On_ Off_ ObjectId Year Is_Weekday Day_Type Period
Length:17891 Length:17891 Length:17891 Length:17891 Length:17891 Length:17891 Min. : 0.0 Min. : 0.0 Min. : 1 Min. :2014 Min. :0.0000 Weekday:7668 Pre-Pandemic :10224
Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.: 16.0 1st Qu.: 17.0 1st Qu.: 4474 1st Qu.:2017 1st Qu.:0.0000 Weekend: 0 Pandemic : 0
Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median : 92.0 Median : 97.0 Median : 8946 Median :2019 Median :0.0000 Sunday :5112 Post-Pandemic: 7667
NA NA NA NA NA NA Mean : 351.3 Mean : 351.6 Mean : 8946 Mean :2019 Mean :0.4286 NA's :5111 NA
NA NA NA NA NA NA 3rd Qu.: 353.0 3rd Qu.: 357.0 3rd Qu.:13418 3rd Qu.:2022 3rd Qu.:1.0000 NA NA
NA NA NA NA NA NA Max. :21739.0 Max. :22339.0 Max. :17891 Max. :2023 Max. :1.0000 NA NA
Code
# Check temporal coverage across all years in panel.
df_clean %>%
  count(Year, name = "Observations") %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Year Observations
2014 852
2015 852
2016 852
2017 2556
2018 2556
2019 2556
2021 2556
2022 2555
2023 2556
Code
# Check zero-ridership tracts which may indicate service gaps.
zero_ridership <- df_clean %>%
  filter(On_ == 0 & Off_ == 0) %>%
  count(Year, name = "Zero_Ridership_Tracts")

zero_ridership %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Year Zero_Ridership_Tracts
2014 54
2015 38
2016 42
2017 246
2018 226
2019 197
2021 246
2022 166
2023 170
Code
# Percentage of zero observations per year.
total_obs <- df_clean %>%
  count(Year, name = "Total")

zero_pct <- zero_ridership %>%
  left_join(total_obs, by = "Year") %>%
  mutate(Percent_Zero = round(Zero_Ridership_Tracts / Total * 100, 2))

zero_pct %>%
  select(Year, Percent_Zero) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Year Percent_Zero
2014 6.34
2015 4.46
2016 4.93
2017 9.62
2018 8.84
2019 7.71
2021 9.62
2022 6.50
2023 6.65
Code
# Geographic coverage across five-county SEPTA service area.
df_clean %>%
  count(County_Name, name = "Tract_Observations") %>%
  arrange(desc(Tract_Observations)) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
County_Name Tract_Observations
Philadelphia County 8442
Montgomery County 3549
Delaware County 2961
Bucks County 1658
Chester County 1197
Mercer County 42
New Castle County 42
Code
# Unique tracts per county to understand spatial extent.
df_clean %>%
  group_by(County_Name) %>%
  summarize(Unique_Tracts = n_distinct(Census_Tract_ID)) %>%
  arrange(desc(Unique_Tracts)) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
County_Name Unique_Tracts
Philadelphia County 402
Montgomery County 169
Delaware County 141
Bucks County 79
Chester County 57
Mercer County 2
New Castle County 2

Distribution Analysis

Ridership Distribution Histograms

Code
# Pivot On_ and Off_ counts into long format for facet.
df_long <- df_clean %>%
  select(Census_Tract_ID, Year, Day_of_Week, On_, Off_) %>%
  pivot_longer(
    cols = c(On_, Off_),
    names_to = "Ridership_Type_Raw",
    values_to = "Count"
  ) %>%
  mutate(
    Ridership_Type = case_when(
      Ridership_Type_Raw == "On_" ~ "Boardings (On)",
      Ridership_Type_Raw == "Off_" ~ "Deboardings (Off)"
    )
  )

# Log transformation column for distribution comparison.
df_hist_data <- df_long %>%
  mutate(
    Log_Count = log10(Count + 1),
    Scale = factor("Raw Count", levels = c("Raw Count", "log(Count + 1)"))
  )

# Duplicate data with log scale for second column facet.
df_log_data <- df_hist_data %>%
  mutate(
    Count = Log_Count,
    Scale = factor("log(Count + 1)", levels = c("Raw Count", "log(Count + 1)"))
  )

# Combine raw and log-transformed data for viz.
df_final_hist <- bind_rows(df_hist_data, df_log_data) %>%
  filter(!is.na(Count))

# 2x2 faceted histogram showing raw and log-transformed distributions.
dist_plot <- ggplot(df_final_hist, aes(x = Count, fill = Ridership_Type)) +
  geom_histogram(bins = 50, position = "identity", alpha = 0.6) +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  scale_fill_manual(values = c("#2E86AB", "#A23B72")) +
  labs(
    title = "Ridership Distribution: Raw vs. Log-Transformed Counts",
    subtitle = "Right-Skew and Overdispersion = Negative Binomial > Poisson",
    x = "Ridership Count",
    y = "Frequency (Number of Observations)",
    fill = "Type",
    caption = "Data: SEPTA Bus Ridership 2014-2023 | All Census Tracts"
  ) +
  facet_grid(Ridership_Type ~ Scale, scales = "free") +
  theme_minimal(base_size = 11) +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold", size = 14),
    strip.text = element_text(face = "bold")
  )

# Distribution plot demonstrating need for count model.
dist_plot

Overdispersion Statistics

Code
# Mean and variance for boardings to quantify overdispersion.
overdispersion_stats <- df_clean %>%
  summarize(
    Mean_On = mean(On_, na.rm = TRUE),
    Variance_On = var(On_, na.rm = TRUE),
    Dispersion_Ratio = Variance_On / Mean_On,
    Mean_Off = mean(Off_, na.rm = TRUE),
    Variance_Off = var(Off_, na.rm = TRUE),
    Dispersion_Ratio_Off = Variance_Off / Mean_Off
  )

cat(sprintf("Boardings (On_):\n"))
Boardings (On_):
Code
cat(sprintf("  Mean: %.2f\n", overdispersion_stats$Mean_On))
  Mean: 351.28
Code
cat(sprintf("  Variance: %.2f\n", overdispersion_stats$Variance_On))
  Variance: 831063.87
Code
cat(sprintf("  Dispersion Ratio: %.2f\n", overdispersion_stats$Dispersion_Ratio))
  Dispersion Ratio: 2365.82
Code
cat(sprintf("\nDeboardings (Off_):\n"))

Deboardings (Off_):
Code
cat(sprintf("  Mean: %.2f\n", overdispersion_stats$Mean_Off))
  Mean: 351.61
Code
cat(sprintf("  Variance: %.2f\n", overdispersion_stats$Variance_Off))
  Variance: 758082.12
Code
cat(sprintf("  Dispersion Ratio: %.2f\n\n", overdispersion_stats$Dispersion_Ratio_Off))
  Dispersion Ratio: 2156.05

Overdispersion stats above (Variance / Mean) Dispersion ratio > 1 means overdispersion. Negative Binomial model handles this variance structure better than Poisson.

Quantile Analysis

Code
# Quantiles to understand concentration of ridership.
quantile_stats <- df_clean %>%
  summarize(
    Min = min(On_),
    Q01 = quantile(On_, 0.01),
    Q05 = quantile(On_, 0.05),
    Q25 = quantile(On_, 0.25),
    Median = median(On_),
    Q75 = quantile(On_, 0.75),
    Q95 = quantile(On_, 0.95),
    Q99 = quantile(On_, 0.99),
    Max = max(On_)
  )

# Boarding count distribution quantiles.
quantile_stats %>%
  pivot_longer(everything(), names_to = "Quantile", values_to = "Boardings") %>%
  kable(digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Quantile Boardings
Min 0.0
Q01 0.0
Q05 0.0
Q25 16.0
Median 92.0
Q75 353.0
Q95 1464.5
Q99 3268.1
Max 21739.0

Temporal Analysis

System-Wide Annual Trend

Code
# Total boardings per year for macro trend.
annual_trend <- df_clean %>%
  group_by(Year) %>%
  summarize(
    Total_On = sum(On_),
    Mean_On = mean(On_),
    Median_On = median(On_),
    .groups = "drop"
  )

# Line plot showing total ridership with pandemic marker.
annual_plot <- ggplot(annual_trend, aes(x = Year, y = Total_On)) +
  geom_line(color = "#1F77B4", linewidth = 1.5) +
  geom_point(color = "#1F77B4", size = 4) +
  geom_vline(xintercept = 2019.5, linetype = "dashed", color = "#D62728", linewidth = 1) +
  annotate("text", x = 2019.5, y = max(annual_trend$Total_On) * 0.95,
           label = "Pre-Pandemic\nBaseline", hjust = 1.1, color = "#D62728", size = 4) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Total System Ridership: 2014-2023",
    subtitle = "Dramatic decline in 2020-2021 followed by partial recovery through 2023.",
    x = "Year (Fall Season)",
    y = "Total Average Daily Boardings",
    caption = "Data: SEPTA Bus Ridership | Red Line = Fall 2019 Baseline"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    axis.title = element_text(face = "bold")
  )

# Trend showing pandemic impact.
annual_plot

Code
# Recovery stats.
recovery_pct <- annual_trend %>%
  filter(Year %in% c(2019, 2023)) %>%
  summarize(
    Ridership_2019 = Total_On[Year == 2019],
    Ridership_2023 = Total_On[Year == 2023],
    Recovery_Pct = (Ridership_2023 / Ridership_2019) * 100
  )

cat(sprintf("2019 Total Boardings: %s\n", comma(recovery_pct$Ridership_2019)))
2019 Total Boardings: 922,207
Code
cat(sprintf("2023 Total Boardings: %s\n", comma(recovery_pct$Ridership_2023)))
2023 Total Boardings: 701,079
Code
cat(sprintf("Recovery Rate: %.1f%%\n", recovery_pct$Recovery_Pct))
Recovery Rate: 76.0%

Weekday vs. Weekend Recovery Comparison

Code
# Total boardings by year and day type for differential recovery.
day_type_trend <- df_clean %>%
  group_by(Year, Day_of_Week) %>%
  summarize(Total_On = sum(On_), .groups = "drop")

# Line plot showing divergent recovery patterns by day type.
day_type_plot <- ggplot(day_type_trend, aes(x = Year, y = Total_On, color = Day_of_Week)) +
  geom_line(linewidth = 1.3) +
  geom_point(size = 3.5) +
  geom_vline(xintercept = 2019.5, linetype = "dashed", color = "gray50", alpha = 0.7) +
  scale_color_manual(values = c("Weekday" = "#2E86AB", "Weekend" = "#A23B72", "Sunday" = "#F18F01")) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Different Recovery: Weekday vs. Weekend Ridership",
    subtitle = "Weekday ridership shows stronger recovery indicating return of commuting patterns.",
    x = "Year (Fall Season)",
    y = "Total Average Daily Boardings",
    color = "Day Type",
    caption = "Data: SEPTA Bus Ridership 2014-2023"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

# Day type comparison showing differential recovery.
day_type_plot

Year-over-Year Growth Rates

Code
# Year-over-year percentage change in ridership.
yoy_growth <- annual_trend %>%
  arrange(Year) %>%
  mutate(
    YoY_Change = (Total_On - lag(Total_On)) / lag(Total_On) * 100,
    Growth_Type = ifelse(YoY_Change >= 0, "Growth", "Decline")
  ) %>%
  filter(!is.na(YoY_Change))

# Bar chart showing year-over-year growth rates.
yoy_plot <- ggplot(yoy_growth, aes(x = factor(Year), y = YoY_Change, fill = Growth_Type)) +
  geom_col(width = 0.7) +
  geom_hline(yintercept = 0, linetype = "solid", color = "black", linewidth = 0.5) +
  scale_fill_manual(values = c("Growth" = "#27AE60", "Decline" = "#E74C3C")) +
  labs(
    title = "Year-Over-Year Ridership Growth Rates",
    subtitle = "Massive decline in 2020 then gradual recovery in years after.",
    x = "Year",
    y = "Year-Over-Year Change (%)",
    fill = "Change Type",
    caption = "Data: SEPTA Bus Ridership"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

# Display year-over-year growth chart.
yoy_plot

Tract Analysis

County-Level Recovery Ratios

Code
# Average boardings and recovery ratio for each county.
county_recovery <- df_clean %>%
  filter(Year %in% c(2019, 2023), Day_of_Week == "Weekday") %>%
  group_by(County_Name, Year) %>%
  summarize(
    Total_On = sum(On_),
    Mean_On = mean(On_),
    Median_On = median(On_),
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = Year,
    values_from = c(Total_On, Mean_On, Median_On),
    names_sep = "_"
  ) %>%
  mutate(
    Recovery_Ratio = Total_On_2023 / Total_On_2019,
    Mean_Recovery = Mean_On_2023 / Mean_On_2019,
    Pct_Change = (Recovery_Ratio - 1) * 100
  ) %>%
  arrange(desc(Recovery_Ratio))

# County-level recovery analysis (2023 vs. 2019).
county_recovery %>%
  select(County_Name, Total_On_2019, Total_On_2023, Recovery_Ratio, Pct_Change) %>%
  kable(
    digits = 2,
    col.names = c("County", "2019 Total", "2023 Total", "Recovery Ratio", "% Change")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
County 2019 Total 2023 Total Recovery Ratio % Change
Chester County 2088 2099 1.01 0.53
Mercer County 54 47 0.87 -12.96
Philadelphia County 436109 328513 0.75 -24.67
Delaware County 39958 28730 0.72 -28.10
Bucks County 4897 3294 0.67 -32.73
New Castle County 195 130 0.67 -33.33
Montgomery County 22767 15154 0.67 -33.44

High-Ridership Tract Concentration

Code
# Top 10 percent of tracts by ridership for spatial concentration.
threshold_90pct <- quantile(df_clean$On_, 0.90)

# Concentration statistics showing inequality.
concentration_stats <- df_clean %>%
  filter(Year == 2023, Day_of_Week == "Weekday") %>%
  mutate(High_Ridership = On_ >= threshold_90pct) %>%
  group_by(High_Ridership) %>%
  summarize(
    Tract_Count = n(),
    Total_Boardings = sum(On_),
    .groups = "drop"
  ) %>%
  mutate(
    Pct_Tracts = Tract_Count / sum(Tract_Count) * 100,
    Pct_Ridership = Total_Boardings / sum(Total_Boardings) * 100
  )

# Ridership concentration for top 10% of rracts in 2023.
concentration_stats %>%
  kable(digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
High_Ridership Tract_Count Total_Boardings Pct_Tracts Pct_Ridership
FALSE 725 142677 85.1 37.7
TRUE 127 235290 14.9 62.3
Code
# Ridge plot showing distribution evolution over time.
ridge_data <- df_clean %>%
  filter(Day_of_Week == "Weekday", On_ > 0) %>%
  mutate(Year_Factor = factor(Year))

ridge_plot <- ggplot(ridge_data, aes(x = log10(On_ + 1), y = Year_Factor, fill = Year_Factor)) +
  geom_density_ridges(alpha = 0.7, scale = 2) +
  scale_fill_viridis_d(option = "turbo") +
  labs(
    title = "Ridership Distribution Across Census Tracts",
    subtitle = "Log-transformed boarding counts show temporal shifts in spatial distribution.",
    x = "log(Boardings + 1)",
    y = "Year",
    fill = "Year",
    caption = "Data: SEPTA Weekday Bus Ridership"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "none"
  )

ridge_plot

Tract-Level Recovery Heterogeneity

Code
# Tract-level recovery ratios comparing 2023 to 2019.
tract_recovery <- df_clean %>%
  filter(Year %in% c(2019, 2023), Day_of_Week == "Weekday") %>%
  group_by(Census_Tract_ID, County_Name, Year) %>%
  summarize(Total_On = sum(On_), .groups = "drop") %>%
  pivot_wider(names_from = Year, values_from = Total_On, names_prefix = "Y") %>%
  filter(Y2019 > 0) %>%
  mutate(Recovery_Ratio = Y2023 / Y2019) %>%
  filter(!is.infinite(Recovery_Ratio))

# Histogram showing distribution of recovery ratios.
recovery_hist <- ggplot(tract_recovery, aes(x = Recovery_Ratio)) +
  geom_histogram(bins = 50, fill = "#3498DB", color = "white", alpha = 0.8) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "red", linewidth = 1) +
  annotate("text", x = 1, y = Inf, label = "Full Recovery", 
           vjust = 2, hjust = 1.1, color = "red", fontface = "bold") +
  labs(
    title = "Distribution of Tract-Level Recovery Ratios (2023 vs. 2019)",
    subtitle = "Wide variation shows some tracts exceeded pre-pandemic ridership while others lag.",
    x = "Recovery Ratio (2023 / 2019)",
    y = "Number of Census Tracts",
    caption = "Data: SEPTA Weekday Bus Ridership | Ratio > 1 = Exceeds 2019 Baseline"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14)
  )

recovery_hist

Code
# Summary stats for recovery distribution.
summary(tract_recovery$Recovery_Ratio) %>%
  as.matrix() %>%
  kable(col.names = "Value", digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Value
Min. 0.000
1st Qu. 0.642
Median 0.745
Mean 0.819
3rd Qu. 0.857
Max. 19.000

Top Growth vs. Decline Tracts

Code
# Top 20 tracts with highest growth and steepest decline.
top_gainers <- tract_recovery %>%
  filter(Y2019 >= 100) %>%
  arrange(desc(Recovery_Ratio)) %>%
  head(20) %>%
  mutate(Type = "Top Gainers")

top_decliners <- tract_recovery %>%
  filter(Y2019 >= 100) %>%
  arrange(Recovery_Ratio) %>%
  head(20) %>%
  mutate(Type = "Steepest Declines")

# Combine and viz winners vs. losers.
winners_losers <- bind_rows(top_gainers, top_decliners)

# Top 10 highest growth tracts in 2023 vs. 2019, minimum 100 boardings in 2019.
top_gainers %>%
  head(10) %>%
  select(Census_Tract_ID, County_Name, Y2019, Y2023, Recovery_Ratio) %>%
  kable(digits = 2, col.names = c("Tract ID", "County", "2019", "2023", "Recovery Ratio")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Tract ID County 2019 2023 Recovery Ratio
42101000102 Philadelphia County 325 627 1.93
42101018801 Philadelphia County 736 1360 1.85
42101021900 Philadelphia County 101 164 1.62
42101034501 Philadelphia County 104 157 1.51
42101015300 Philadelphia County 326 457 1.40
42101021500 Philadelphia County 213 284 1.33
42029302500 Chester County 145 188 1.30
42101034502 Philadelphia County 837 1047 1.25
42101017300 Philadelphia County 998 1213 1.22
42101036000 Philadelphia County 741 869 1.17
Code
# Top 10 steepest decline tracts in 2023 vs. 2019, minimum 100 boardings in 2019.
top_decliners %>%
  head(10) %>%
  select(Census_Tract_ID, County_Name, Y2019, Y2023, Recovery_Ratio) %>%
  kable(digits = 2, col.names = c("Tract ID", "County", "2019", "2023", "Recovery Ratio")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Tract ID County 2019 2023 Recovery Ratio
42091203207 Montgomery County 152 25 0.16
42101038400 Philadelphia County 170 31 0.18
10003010105 New Castle County 195 38 0.19
42029302706 Chester County 123 24 0.20
42017101808 Bucks County 102 27 0.26
42101034600 Philadelphia County 761 217 0.29
42101037800 Philadelphia County 838 247 0.29
42091203000 Montgomery County 108 36 0.33
42101021700 Philadelphia County 716 251 0.35
42091205810 Montgomery County 485 179 0.37

Brief Feature Engineering

Temporal Lag Variable

Code
# Temporal lag feature for all tracts and years.
df_with_lag <- df_clean %>%
  filter(Day_of_Week == "Weekday") %>%
  arrange(Census_Tract_ID, Year) %>%
  group_by(Census_Tract_ID) %>%
  mutate(
    On_Lag1 = lag(On_, n = 1, order_by = Year),
    Off_Lag1 = lag(Off_, n = 1, order_by = Year)
  ) %>%
  ungroup()

# Check missing lag values in first year for each tract.
lag_missing <- df_with_lag %>%
  summarize(
    Total_Obs = n(),
    Missing_Lag = sum(is.na(On_Lag1)),
    Pct_Missing = Missing_Lag / Total_Obs * 100
  )
Code
# Temporal lag variable coverage.
lag_missing %>%
  kable(digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Total_Obs Missing_Lag Pct_Missing
7668 852 11.1
Code
# First few rows with lag variable.
df_with_lag %>%
  select(Census_Tract_ID, Year, On_, On_Lag1) %>%
  head(15) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Census_Tract_ID Year On_ On_Lag1
10003010105 2014 173 NA
10003010105 2015 163 173
10003010105 2016 195 163
10003010105 2017 228 195
10003010105 2018 180 228
10003010105 2019 195 180
10003010105 2021 119 195
10003010105 2022 33 119
10003010105 2023 38 33
10003010106 2014 0 NA
10003010106 2015 0 0
10003010106 2016 0 0
10003010106 2017 0 0
10003010106 2018 0 0
10003010106 2019 0 0

Tract-Level Stats for FE

Code
# Tract-level summary stats for FE analysis.
tract_stats <- df_clean %>%
  filter(Day_of_Week == "Weekday") %>%
  group_by(Census_Tract_ID, County_Name) %>%
  summarize(
    Years_Observed = n_distinct(Year),
    Mean_Boardings = mean(On_, na.rm = TRUE),
    SD_Boardings = sd(On_, na.rm = TRUE),
    CV_Boardings = SD_Boardings / Mean_Boardings,
    Min_Year = min(Year),
    Max_Year = max(Year),
    .groups = "drop"
  ) %>%
  mutate(Full_Panel = Years_Observed == 10)

# Tract-level panel balance summary.
tract_stats %>%
  count(Years_Observed, name = "Number_of_Tracts") %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Years_Observed Number_of_Tracts
9 852
Code
# Check for balanced panel structure.
balanced_pct <- sum(tract_stats$Full_Panel) / nrow(tract_stats) * 100
cat(sprintf("\nPercentage of tracts with complete 10-year panel: %.1f%%\n", balanced_pct))

Percentage of tracts with complete 10-year panel: 0.0%

Coefficient of Variation

Code
# Coefficient of variation to identify stable vs. volatile tracts.
cv_plot <- ggplot(tract_stats %>% filter(Mean_Boardings > 10), 
                  aes(x = Mean_Boardings, y = CV_Boardings)) +
  geom_hex(bins = 40) +
  scale_fill_viridis_c(option = "magma", trans = "log10", labels = comma) +
  scale_x_log10(labels = comma) +
  scale_y_log10() +
  geom_hline(yintercept = 1, linetype = "dashed", color = "white") +
  labs(
    title = "Tract Volatility: Coefficient of Variation vs. Mean Ridership",
    subtitle = "Higher CV means less predictable ridership patterns requiring closer examination.",
    x = "Mean Weekday Boardings (2014-2023) [Log Scale]",
    y = "Coefficient of Variation (SD / Mean) [Log Scale]",
    fill = "Tract Count",
    caption = "Data: SEPTA Bus Ridership | CV = 1 Ref Line"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14)
  )

cv_plot

Data Quality and Missingness

Zero-Ridership Pattern

Code
# Percent of zero-ridership observations by year and county.
zero_patterns <- df_clean %>%
  mutate(Is_Zero = ifelse(On_ == 0 & Off_ == 0, 1, 0)) %>%
  group_by(Year, County_Name) %>%
  summarize(
    Total_Obs = n(),
    Zero_Count = sum(Is_Zero),
    Zero_Pct = Zero_Count / Total_Obs * 100,
    .groups = "drop"
  )

# Zero-ridership patterns across counties and time.
zero_heatmap <- ggplot(zero_patterns, aes(x = factor(Year), y = County_Name, fill = Zero_Pct)) +
  geom_tile(color = "white", linewidth = 0.5) +
  scale_fill_gradient2(low = "#2ECC71", mid = "#F39C12", high = "#E74C3C",
                       midpoint = 10, name = "% Zero Ridership") +
  geom_text(aes(label = sprintf("%.1f%%", Zero_Pct)), size = 3) +
  labs(
    title = "Zero-Ridership Patterns by County and Year",
    subtitle = "Percentage of tract-observations with no recorded bus boardings or deboardings.",
    x = "Year",
    y = "County",
    caption = "Data: SEPTA Bus Ridership"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

zero_heatmap

Service Coverage

Code
# Tracts with consistent service vs. sporadic coverage.
service_coverage <- df_clean %>%
  mutate(Has_Service = ifelse(On_ > 0 | Off_ > 0, 1, 0)) %>%
  group_by(Census_Tract_ID, County_Name) %>%
  summarize(
    Years_With_Service = sum(Has_Service),
    Total_Years = n_distinct(Year),
    Coverage_Rate = Years_With_Service / Total_Years * 100,
    .groups = "drop"
  ) %>%
  mutate(
    Coverage_Category = case_when(
      Coverage_Rate == 100 ~ "Consistent Service (100%)",
      Coverage_Rate >= 75 ~ "Mostly Served (75-99%)",
      Coverage_Rate >= 50 ~ "Intermittent Service (50-74%)",
      Coverage_Rate > 0 ~ "Minimal Service (1-49%)",
      TRUE ~ "No Service"
    )
  )

# Service coverage distribution.
service_coverage %>%
  count(Coverage_Category, name = "Tract_Count") %>%
  mutate(Percentage = round(Tract_Count / sum(Tract_Count) * 100, 1)) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Coverage_Category Tract_Count Percentage
Consistent Service (100%) 16 1.9
Intermittent Service (50-74%) 3 0.4
Minimal Service (1-49%) 6 0.7
Mostly Served (75-99%) 827 97.1

Cross-Sectional Variation

Distribution by Day Type

Code
# Violin plot comparing distributions across day types.
day_violin <- df_clean %>%
  filter(On_ > 0) %>%
  ggplot(aes(x = Day_of_Week, y = log10(On_ + 1), fill = Day_of_Week)) +
  geom_violin(alpha = 0.7, draw_quantiles = c(0.25, 0.5, 0.75)) +
  geom_boxplot(width = 0.1, outlier.alpha = 0.3, alpha = 0.5) +
  scale_fill_manual(values = c("Weekday" = "#2E86AB", "Weekend" = "#A23B72", "Sunday" = "#F18F01")) +
  labs(
    title = "Ridership Distribution by Day Type",
    subtitle = "Weekdays show higher and more concentrated ridership patterns.",
    x = "Day Type",
    y = "log(Boardings + 1)",
    fill = "Day Type",
    caption = "Data: SEPTA Bus Ridership 2014-2023"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "none"
  )

day_violin

County Comparison Box Plots

Code
# Faceted box plots showing ridership distribution by county over time.
county_box <- df_clean %>%
  filter(Day_of_Week == "Weekday", On_ > 0) %>%
  ggplot(aes(x = factor(Year), y = log10(On_ + 1), fill = County_Name)) +
  geom_boxplot(outlier.alpha = 0.2, outlier.size = 0.5) +
  scale_fill_viridis_d(option = "turbo") +
  labs(
    title = "Ridership Distribution by County Across Years",
    subtitle = "Philadelphia County dominates ridership volume with different recovery patterns.",
    x = "Year",
    y = "log(Weekday Boardings + 1)",
    fill = "County",
    caption = "Data: SEPTA Weekday Bus Ridership"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom"
  ) +
  facet_wrap(~County_Name, ncol = 2, scales = "free_y")

county_box

Correlation Analysis

Boardings vs. Deboardings Relationship

Code
# Correlation between boardings and deboardings.
on_off_cor <- cor(df_clean$On_, df_clean$Off_, use = "complete.obs")

# Scatter plot showing boarding-deboarding relationship.
on_off_scatter <- df_clean %>%
  filter(Day_of_Week == "Weekday", Year == 2023) %>%
  ggplot(aes(x = On_ + 1, y = Off_ + 1)) +
  geom_hex(bins = 50) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red", linewidth = 1) +
  scale_fill_viridis_c(option = "plasma", trans = "log10", labels = comma) +
  scale_x_log10(labels = comma) +
  scale_y_log10(labels = comma) +
  annotate("text", x = 10, y = 10000,
           label = sprintf("Correlation: %.3f", on_off_cor),
           size = 5, color = "white", fontface = "bold") +
  labs(
    title = "Boardings vs. Deboardings Correlation (2023 Weekday)",
    subtitle = "Strong correlation suggests tracts serve both origins and destinations.",
    x = "Boardings (On_) [Log Scale]",
    y = "Deboardings (Off_) [Log Scale]",
    fill = "Density",
    caption = "Data: SEPTA Weekday Bus Ridership | Diagonal = Perfect Balance"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14)
  )

on_off_scatter

Code
cat(sprintf("Boardings-Deboardings Correlation: %.4f\n", on_off_cor))
Boardings-Deboardings Correlation: 0.9767

Correlation Matrix of Key Variables

Code
# Correlation matrix for numeric variables.
cor_data <- df_clean %>%
  filter(Day_of_Week == "Weekday") %>%
  # I removed Is_Weekday because It's a dummy variable and gives an SD warning and it bothers me.
  select(On_, Off_, Year) %>%
  cor(use = "complete.obs")

# Correlation matrix with color coding.
corrplot(cor_data, method = "color", type = "upper",
         addCoef.col = "black", number.cex = 0.8,
         tl.col = "black", tl.srt = 45,
         col = colorRampPalette(c("#E74C3C", "white", "#3498DB"))(200),
         title = "Correlation Matrix: Key Ridership Variables",
         mar = c(0, 0, 2, 0))

Potential Train / Test Split Strategy?

Code
# Temporal train-test split for out-of-sample validation.
train_years <- 2014:2022
test_year <- 2023

# Training dataset for model development.
df_train <- df_clean %>%
  filter(Year %in% train_years, Day_of_Week == "Weekday")

# Test dataset for final validation.
df_test <- df_clean %>%
  filter(Year == test_year, Day_of_Week == "Weekday")

# Train/Test split summary.
cat(sprintf("Training Period: %d - %d\n", min(train_years), max(train_years)))
Training Period: 2014 - 2022
Code
cat(sprintf("Training Observations: %s\n", comma(nrow(df_train))))
Training Observations: 6,816
Code
cat(sprintf("\nTest Period: %d\n", test_year))

Test Period: 2023
Code
cat(sprintf("Test Observations: %s\n", comma(nrow(df_test))))
Test Observations: 852
Code
cat(sprintf("\nTrain/Test Ratio: %.1f%%\n", nrow(df_train) / nrow(df_clean) * 100))

Train/Test Ratio: 38.1%
Code
# Check for data leakage ensuring no overlap.
cat("\nData Leakage Check:\n")

Data Leakage Check:
Code
overlap <- intersect(
  paste(df_train$Census_Tract_ID, df_train$Year),
  paste(df_test$Census_Tract_ID, df_test$Year)
)

# Should be 0.
cat(sprintf("Overlapping Observations: %d\n", length(overlap)))
Overlapping Observations: 0

Key Stuff for Modeling

Summary Stats Table

Code
# Summary statistics table for potential features.
summary_table <- df_clean %>%
  filter(Day_of_Week == "Weekday") %>%
  summarize(
    Mean_On = mean(On_, na.rm = TRUE),
    Median_On = median(On_, na.rm = TRUE),
    SD_On = sd(On_, na.rm = TRUE),
    Min_On = min(On_, na.rm = TRUE),
    Max_On = max(On_, na.rm = TRUE),
    Zero_Count = sum(On_ == 0),
    Zero_Pct = Zero_Count / n() * 100,
    Unique_Tracts = n_distinct(Census_Tract_ID),
    Total_Obs = n()
  )

# Summary stats for weekday boardings.
summary_table %>%
  pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
  mutate(Value = ifelse(Statistic %in% c("Mean_On", "Median_On", "SD_On", "Min_On", "Max_On"),
                        round(Value, 1), round(Value, 0))) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Statistic Value
Mean_On 546.1
Median_On 165.0
SD_On 1246.1
Min_On 0.0
Max_On 21739.0
Zero_Count 336.0
Zero_Pct 4.0
Unique_Tracts 852.0
Total_Obs 7668.0

Spatial Analysis

Code
# Load Pennsylvania census tract geometry for spatial visualization.
tract_geo <- st_read("data/pa_tracts/pa_tracts.shp")
Reading layer `pa_tracts' from data source 
  `C:\Users\Tess\Desktop\UPenn\UPenn_FW25\MUSA_5080-401_Public_Policy_Analytics\shark-tank\data\pa_tracts\pa_tracts.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 3446 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -80.51985 ymin: 39.7198 xmax: -74.68956 ymax: 42.51607
Geodetic CRS:  NAD83
Code
# Define FIPS codes for Philadelphia and surrounding counties.
septa_county_fips <- c("017", "029", "045", "091", "101")

# Alternative: Philadelphia only.
# septa_county_fips <- c("101")

# Transform to WGS84 coordinate system and filter to SEPTA service area.
final_map_base <- tract_geo %>%
  st_transform(4326) %>%
  filter(COUNTYFP %in% septa_county_fips)

# Calculate total ridership by census tract for Fall 2023.
current_ridership <- df_clean %>%
  filter(Year == 2023) %>%
  group_by(Census_Tract_ID) %>%
  summarize(
    Final_On_Count = sum(On_),
    Final_Off_Count = sum(Off_)
  ) %>%
  mutate(Census_Tract_ID = as.character(Census_Tract_ID))

# Join ridership data to geometry for spatial visualization.
tract_map_data <- final_map_base %>%
  left_join(current_ridership, by = c("GEOID" = "Census_Tract_ID"))

# Filter to only tracts with recorded ridership data.
tract_map_data_with_ridership <- tract_map_data %>%
  filter(!is.na(Final_On_Count))

# Load SEPTA transit route geometry for network visualization.
routes <- st_read("data/septa/Transit_Routes_Spring_2025/Transit_Routes.shp")
Reading layer `Transit_Routes' from data source 
  `C:\Users\Tess\Desktop\UPenn\UPenn_FW25\MUSA_5080-401_Public_Policy_Analytics\shark-tank\data\septa\Transit_Routes_Spring_2025\Transit_Routes.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 178 features and 18 fields
Geometry type: MULTILINESTRING
Dimension:     XY
Bounding box:  xmin: -8441925 ymin: 4837473 xmax: -8321361 ymax: 4914725
Projected CRS: WGS 84 / Pseudo-Mercator
Code
# Create choropleth map of boarding counts using log scale transformation.
spatial_plot_log_on <- ggplot(tract_map_data_with_ridership) +
  geom_sf(aes(fill = Final_On_Count + 1), color = "#ff4100", linewidth = 0.25) +
  scale_fill_viridis_c(
    name = "Avg Daily Boardings",
    trans = "log10",
    labels = comma,
    na.value = "gray",
    option = "mako",
    direction = -1
  ) +
  labs(
    title = "Spatial Distribution of Bus Ridership (Fall 2023)",
    subtitle = "Geographic clustering of log(Rider On Count + 1)."
  ) +
  theme_void() +
  theme(plot.background = element_rect(fill = "#dedbd4", color = "#dedbd4"))

# Create choropleth map of alighting counts using log scale transformation.
spatial_plot_log_off <- ggplot(tract_map_data_with_ridership) +
  geom_sf(aes(fill = Final_Off_Count + 1), color = "#ff4100", linewidth = 0.25) +
  scale_fill_viridis_c(
    name = "Avg Daily Alightings",
    trans = "log10",
    labels = comma,
    na.value = "gray",
    option = "mako",
    direction = -1
  ) +
  labs(
    title = "Spatial Distribution of Bus Ridership (Fall 2023)",
    subtitle = "Geographic clustering of log(Rider Off Count + 1)."
  ) +
  theme_void() +
  theme(plot.background = element_rect(fill = "#dedbd4", color = "#dedbd4"))

# Create map showing complete SEPTA route network over census tracts.
septa_route_map <- ggplot(tract_map_data_with_ridership) +
  geom_sf(fill = NA, color = "#ff4100", linewidth = 0.25) +
  geom_sf(data = routes, color = "#778ac5", fill = NA, linewidth = 0.5) +
  scale_color_discrete() +
  labs(
    title = "SEPTA Routes",
    subtitle = "All SEPTA Routes"
  ) +
  theme_void() +
  theme(plot.background = element_rect(fill = "#dedbd4", color = "#dedbd4"))

# Combine all three maps vertically with shared legend at bottom.
(septa_route_map / spatial_plot_log_on / spatial_plot_log_off) +
  plot_layout(guides = "collect") &
  theme(
    legend.position = "bottom",
    legend.direction = "horizontal"
  )

Key Things for Negative Binomial Model

Overdispersion Justification

  • Variance/Mean Ratio: 2365.82 > 1.
  • Negative Binomial model handles overdispersion better than Poisson.

Temporal Autocorrelation

  • Lag-1 Correlation: 0.9593.
  • Strong temporal persistence justifies On_Lag1 feature.

Spatial Heterogeneity

  • Philadelphia County: 47.2% of observations.
  • Wide variation in recovery ratios across tracts.
  • Fixed effects needed to control tract-level unobservables.

Structural Break

  • 2023 Recovery Rate: 76.0% of 2019 baseline.
  • Pandemic created regime change requiring post-2020 analysis.

Model Specification

  • Outcome Variable: On_ (Average Daily Boardings).
  • Model Family: Negative Binomial GLM.
  • Key Features:
    • On_Lag1: Temporal lag (ridership inertia).
    • Spatial_Lag: Neighborhood spillover effects.
    • Tract Fixed Effects: Control unobserved heterogeneity.
    • Year Fixed Effects: Control macro shocks.

Validation Strategy?

  • Temporal Train/Test Split: 2014-2022 train, 2023 test.
  • No random split to avoid data leakage.
  • Goal: Predict new normal after pandemic.

Potential Feature Engineering Pipeline?

Pulled from Dr. Delmelle’s instructions and had to put them here in a way that would make me less confused.

Temporal Features

  • On_Lag1 (t-1 ridership).
  • On_Lag2 (t-2 ridership) for robustness check.
  • Rolling averages (3-year, 5-year).
  • Growth rate features (YoY change).

Spatial Features

  • Spatial weights matrix (queen contiguity).
  • Spatial lag: W_On_Lag1 (neighbor average).
  • Distance to nearest Regional Rail station.
  • Buffer features of amenity counts within 0.5 mile? Could use osmdata library.

Demographic Features

  • Population, median income, vehicle ownership.
  • Could make a transit dependency index?

Model Estimation

  • Negative Binomial regression with MASS library’s glm.nb function.
  • Tract and year fixed effects could be useful?
  • Standard errors.
  • Diagnostics (residual plots, dispersion parameter).

Validation / Interpretation

  • Predict 2023 ridership on test set.
  • Calculate MAE, RMSE, and prediction intervals.
  • Tracts with largest prediction errors.
  • Map of predicted vs. actual ridership.

Policy Recommendations

  • Rank tracts by predicted 2025 ridership growth.
  • Identify service expansion priorities.
  • Equity implications of recommendations.

Session Info

Code
# R session information for reproducibility.
sessionInfo()
R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] knitr_1.50        ggridges_0.5.7    corrplot_0.95     kableExtra_1.4.0 
 [5] viridis_0.6.5     viridisLite_0.4.2 patchwork_1.3.2   scales_1.4.0     
 [9] sf_1.0-21         lubridate_1.9.4   forcats_1.0.0     stringr_1.5.1    
[13] dplyr_1.1.4       purrr_1.1.0       readr_2.1.5       tidyr_1.3.1      
[17] tibble_3.3.0      ggplot2_3.5.2     tidyverse_2.0.0  

loaded via a namespace (and not attached):
 [1] generics_0.1.4     xml2_1.4.0         class_7.3-23       KernSmooth_2.23-26
 [5] lattice_0.22-7     stringi_1.8.7      hms_1.1.3          digest_0.6.37     
 [9] magrittr_2.0.3     evaluate_1.0.4     grid_4.5.1         timechange_0.3.0  
[13] RColorBrewer_1.1-3 fastmap_1.2.0      jsonlite_2.0.0     e1071_1.7-16      
[17] DBI_1.2.3          gridExtra_2.3      textshaping_1.0.1  cli_3.6.5         
[21] rlang_1.1.6        units_0.8-7        withr_3.0.2        yaml_2.3.10       
[25] tools_4.5.1        tzdb_0.5.0         vctrs_0.6.5        R6_2.6.1          
[29] proxy_0.4-27       lifecycle_1.0.4    classInt_0.4-11    htmlwidgets_1.6.4 
[33] pkgconfig_2.0.3    hexbin_1.28.5      pillar_1.11.1      gtable_0.3.6      
[37] glue_1.8.0         Rcpp_1.1.0         systemfonts_1.2.3  xfun_0.53         
[41] tidyselect_1.2.1   rstudioapi_0.17.1  farver_2.1.2       htmltools_0.5.8.1 
[45] labeling_0.4.3     svglite_2.2.1      rmarkdown_2.29     compiler_4.5.1    

Potential Model Data?

Code
# Export cleaned data with temporal lags for modeling phase.
df_model_ready <- df_clean %>%
  filter(Day_of_Week == "Weekday") %>%
  arrange(Census_Tract_ID, Year) %>%
  group_by(Census_Tract_ID) %>%
  mutate(
    On_Lag1 = lag(On_, n = 1, order_by = Year),
    Off_Lag1 = lag(Off_, n = 1, order_by = Year)
  ) %>%
  ungroup()

# Save data for modeling.
# write_csv(df_model_ready, "data/septa_processed_for_modeling.csv")

cat(sprintf("Model observations: %s\n", comma(nrow(df_model_ready))))
Model observations: 7,668
Code
cat(sprintf("Variables: %d\n", ncol(df_model_ready)))
Variables: 15

Key Variables?

  • On_: Current year boardings (outcome).
  • On_Lag1: Previous year boardings (temporal lag).
  • Census_Tract_ID: Tract identifier (for fixed effects).
  • Year: Time period (for fixed effects).
  • County_Name: County identifier.

Additional Viz

Integrating More SEPTA Data

  • Stop-level data (Fall 2023): Granular boarding patterns at individual bus stops.
  • Route performance (2019-2024): Temporal trends for individual bus routes.
  • Modal comparison (2019-2024): Bus vs. Regional Rail recovery.
  • Regional Rail stations: Potential bus-rail transfer points and competition.

Helps identify high-performing routes for service replication, understand bus-rail modal shifts, stop-level hotspot analysis for micro-interventions, check tract-level patterns with finer ridership.

Stop Analysis

Load Stop-Level Data

Code
# Load Fall 2023 bus stop summary with geographic coordinates.
df_stops <- read.csv("data/septa/Fall_2023_Stop_Summary_Bus.csv")

# Clean and prepare stop data with total ridership calculations.
df_stops_clean <- df_stops %>%
  mutate(
    # Total boardings across all day types.
    Total_Boardings = Weekday_On + Saturday_O + Sunday_Ons,
    Total_Alightings = Weekday_Of + Saturday_1 + Sunday_Off,
    Total_Activity = Total_Boardings + Total_Alightings,
    # Dominant day type indicator.
    Weekday_Dominance = Weekday_On / (Weekday_On + Saturday_O + Sunday_Ons + 0.001),
    # Clean route numbers. Remove non-numeric characters after trying to convert numeric characters.
    Route_Numeric = as.numeric(ifelse(grepl("^[0-9]+$", Route), Route, NA))
  ) %>%
  # Filter stops with no activity.
  filter(Total_Activity > 0)

# Summary stats for stop-level data.
cat(sprintf("Total Stops: %s\n", comma(n_distinct(df_stops_clean$Stop_Code))))
Total Stops: 12,126
Code
cat(sprintf("Total Routes: %s\n", n_distinct(df_stops_clean$Route)))
Total Routes: 124
Code
cat(sprintf("Mean Weekday Boardings per Stop: %.1f\n", mean(df_stops_clean$Weekday_On)))
Mean Weekday Boardings per Stop: 22.5
Code
cat(sprintf("Median Weekday Boardings per Stop: %.1f\n", median(df_stops_clean$Weekday_On)))
Median Weekday Boardings per Stop: 6.0

Top 50 Highest-Volume Bus Stops

Code
# Top 50 stops by weekday boarding activity (Fall 2023 Weekday Boardings).
top_stops <- df_stops_clean %>%
  arrange(desc(Weekday_On)) %>%
  head(50) %>%
  mutate(Rank = row_number())

top_stops %>%
  head(10) %>%
  select(Rank, Stop, Route, Weekday_On, Total_Activity, Lat, Lon) %>%
  kable(digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Rank Stop Route Weekday_On Total_Activity Lat Lon
1 Olney Transportation Center - Rt L L 1498 2641 40 -75.1
2 Olney Transportation Center - Rt 6 6 1496 2807 40 -75.1
3 Frankford Transportation Center-14-BLVDDIR-MFO 14 1349 2692 40 -75.1
4 Frankford Transportation Center - Rt 58 58 1271 2370 40 -75.1
5 Olney Av & Broad St 18 963 3954 40 -75.1
6 11th St & Market St 23 915 1858 40 -75.2
7 56th St & Market St - FS G 909 2964 40 -75.2
8 Olney Transportation Center - Rt 55 55 901 1881 40 -75.1
9 69th St Transportation Center West Terminal 104 893 1929 40 -75.3
10 69th St Transportation Center North Terminal 65 876 1715 40 -75.3

Stop Distribution Analysis

Code
# Histogram of weekday boardings distribution across all stops.
stop_dist_plot <- df_stops_clean %>%
  filter(Weekday_On > 0) %>%
  ggplot(aes(x = log10(Weekday_On + 1))) +
  geom_histogram(bins = 50, fill = "#3498DB", color = "white", alpha = 0.8) +
  geom_vline(xintercept = log10(median(df_stops_clean$Weekday_On[df_stops_clean$Weekday_On > 0]) + 1),
             linetype = "dashed", color = "#E74C3C", linewidth = 1) +
  annotate("text", 
           x = log10(median(df_stops_clean$Weekday_On[df_stops_clean$Weekday_On > 0]) + 1), 
           y = Inf,
           label = sprintf("Median = %.0f", median(df_stops_clean$Weekday_On[df_stops_clean$Weekday_On > 0])),
           vjust = 2, hjust = 1.1, color = "#E74C3C", fontface = "bold") +
  labs(
    title = "Distribution of Weekday Boardings Across Bus Stops",
    subtitle = "High concentration in few stops suggests strategic intervention points.",
    x = "log(Weekday Boardings + 1)",
    y = "Number of Stops",
    caption = "Data: SEPTA Fall 2023 Bus Stop Summary | Positive Boardings Only"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 14))

stop_dist_plot

Code
# Concentration stats.
total_boardings <- sum(df_stops_clean$Weekday_On)
top_10pct_threshold <- quantile(df_stops_clean$Weekday_On, 0.90)
top_10pct_boardings <- sum(df_stops_clean$Weekday_On[df_stops_clean$Weekday_On >= top_10pct_threshold])
concentration_pct <- top_10pct_boardings / total_boardings * 100

cat(sprintf("Top 10%% of stops account for %.1f%% of all weekday boardings\n", concentration_pct))
Top 10% of stops account for 59.7% of all weekday boardings

Weekday Dominance

Code
# Proportion of activity occurring on weekdays vs. weekends.
dominance_plot <- df_stops_clean %>%
  filter(Total_Boardings > 10) %>%
  ggplot(aes(x = Weekday_Dominance)) +
  geom_histogram(bins = 50, fill = "#9B59B6", color = "white", alpha = 0.8) +
  geom_vline(xintercept = 0.5, linetype = "dashed", color = "black", linewidth = 1) +
  annotate("text", x = 0.5, y = Inf, label = "Equal Split", 
           vjust = 2, hjust = 1.1, fontface = "bold") +
  scale_x_continuous(labels = percent) +
  labs(
    title = "Weekday Ridership Dominance at Bus Stops",
    subtitle = "Most stops are weekday-dominant reflecting commuting patterns.",
    x = "Proportion of Total Boardings on Weekdays",
    y = "Number of Stops",
    caption = "Data: SEPTA Fall 2023 | Stops with 10+ Total Boardings"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 14))

dominance_plot

Code
# Weekday dominance stats.
cat("\nWeekday Dominance Statistics:\n")

Weekday Dominance Statistics:
Code
cat(sprintf("Mean Weekday Proportion: %.1f%%\n", mean(df_stops_clean$Weekday_Dominance) * 100))
Mean Weekday Proportion: 52.0%
Code
cat(sprintf("Stops with >80%% Weekday Activity: %.1f%%\n", 
            sum(df_stops_clean$Weekday_Dominance > 0.8) / nrow(df_stops_clean) * 100))
Stops with >80% Weekday Activity: 16.1%

Stop Spatial Hotspots

Code
# Bounding box.
min_lon <- min(df_stops_clean$Lon, na.rm = TRUE)
max_lon <- max(df_stops_clean$Lon, na.rm = TRUE)
min_lat <- min(df_stops_clean$Lat, na.rm = TRUE)
max_lat <- max(df_stops_clean$Lat, na.rm = TRUE)

# Spatial point map showing high-volume stops.
stop_spatial_plot <- df_stops_clean %>%
  filter(Total_Boardings > 0) %>%
  ggplot() +
  geom_sf(data = tract_map_data, fill = NA, color = "gray60", linewidth = 0.25) +
  geom_point(
    aes(x = Lon, y = Lat, size = Weekday_On, color = log10(Weekday_On + 1)),
    alpha = 0.6
    ) +
  scale_size_continuous(range = c(0.5, 8), labels = comma, name = "Weekday Boardings") +
  scale_color_viridis_c(option = "plasma", direction = -1, labels = comma, name = "log(Boardings)") +
  labs(
    title = "Bus Stop Ridership Spatial Hotspots (Fall 2023)",
    subtitle = "Point size and color indicate weekday boarding intensity.",
    x = "Longitude",
    y = "Latitude",
    caption = "Data: SEPTA Bus Stop Summary Fall 2023"
    ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right",
    panel.grid = element_line(color = "gray90")
    ) +
  coord_sf(
    xlim = c(min_lon, max_lon), 
    ylim = c(min_lat, max_lat)
  )

stop_spatial_plot

Route Stop Performance

Code
# Route-level statistics from stop data.
route_stop_stats <- df_stops_clean %>%
  filter(!is.na(Route_Numeric)) %>%
  group_by(Route) %>%
  summarize(
    N_Stops = n_distinct(Stop_Code),
    Total_Weekday_Boardings = sum(Weekday_On),
    Mean_Boardings_Per_Stop = mean(Weekday_On),
    Median_Boardings_Per_Stop = median(Weekday_On),
    Max_Boardings = max(Weekday_On),
    .groups = "drop"
  ) %>%
  arrange(desc(Total_Weekday_Boardings)) %>%
  head(20)

# Top 20 routes by total weekday boardings.
route_bar_plot <- ggplot(route_stop_stats, 
                         aes(x = reorder(Route, Total_Weekday_Boardings), 
                             y = Total_Weekday_Boardings)) +
  geom_col(aes(fill = Total_Weekday_Boardings), width = 0.7) +
  coord_flip() +
  scale_fill_gradient(low = "#BDC3C7", high = "#2980B9", labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Top 20 Bus Routes by Total Weekday Boardings (Fall 2023)",
    subtitle = "Aggregated from stop-level data showing route-level demand.",
    x = "Route Number",
    y = "Total Weekday Boardings Across All Stops",
    fill = "Total Boardings",
    caption = "Data: SEPTA Fall 2023 Stop Summary"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right"
  )

route_bar_plot

Route Temporal Analysis

Load Route Performance Data

Code
# Load average daily ridership by route for temporal analysis.
df_route <- read.csv("data/septa/Average_Daily_Ridership_By_Route.csv")

# Clean and prepare route data.
df_route_clean <- df_route %>%
  mutate(
    Date = make_date(Calendar_Year, Calendar_Month),
    Year = Calendar_Year,
    Month = Calendar_Month,
    Route_Char = as.character(Route)
  ) %>%
  filter(!is.na(Date), !is.na(Average_Daily_Ridership))

# Summary of route data coverage.
cat(sprintf("Date Range: %s to %s\n", min(df_route_clean$Date), max(df_route_clean$Date)))
Date Range: 2019-01-01 to 2025-09-01
Code
cat(sprintf("Unique Routes: %s\n", n_distinct(df_route_clean$Route)))
Unique Routes: 141
Code
cat(sprintf("Total Observations: %s\n", comma(nrow(df_route_clean))))
Total Observations: 10,994

Top Routes Recovery Analysis

Code
# Recovery ratio for each route comparing 2023 to 2019.
route_recovery <- df_route_clean %>%
  filter(Month %in% c(9, 10, 11), Year %in% c(2019, 2023)) %>%
  group_by(Route, Year) %>%
  summarize(Avg_Ridership = mean(Average_Daily_Ridership), .groups = "drop") %>%
  pivot_wider(names_from = Year, values_from = Avg_Ridership, names_prefix = "Y") %>%
  filter(!is.na(Y2019) & !is.na(Y2023), Y2019 > 100) %>%
  mutate(
    Recovery_Ratio = Y2023 / Y2019,
    Recovery_Category = case_when(
      Recovery_Ratio >= 1.1 ~ "Strong Growth (>110%)",
      Recovery_Ratio >= 1.0 ~ "Full Recovery (100-110%)",
      Recovery_Ratio >= 0.8 ~ "Partial Recovery (80-100%)",
      TRUE ~ "Weak Recovery (<80%)"
    )
  ) %>%
  arrange(desc(Recovery_Ratio))

route_recovery %>%
  count(Recovery_Category, name = "N_Routes") %>%
  mutate(Pct_Routes = N_Routes / sum(N_Routes) * 100) %>%
  kable(digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Recovery_Category N_Routes Pct_Routes
Full Recovery (100-110%) 2 1.5
Partial Recovery (80-100%) 19 14.5
Weak Recovery (<80%) 110 84.0

Top Recovering Routes Viz

Code
# Top 15 recovering routes.
top_recovery_plot <- route_recovery %>%
  head(15) %>%
  ggplot(aes(x = reorder(Route, Recovery_Ratio), y = Recovery_Ratio, fill = Recovery_Category)) +
  geom_col(width = 0.7) +
  geom_hline(yintercept = 1.0, linetype = "dashed", color = "black", linewidth = 0.8) +
  coord_flip() +
  scale_fill_manual(values = c(
    "Strong Growth (>110%)" = "#27AE60",
    "Full Recovery (100-110%)" = "#F39C12",
    "Partial Recovery (80-100%)" = "#E67E22",
    "Weak Recovery (<80%)" = "#E74C3C"
  )) +
  labs(
    title = "Top 15 Bus Routes by Recovery Ratio (Fall 2023 vs. Fall 2019)",
    subtitle = "Routes exceeding 100% indicate stronger demand than pre-pandemic baseline.",
    x = "Route Number",
    y = "Recovery Ratio (Fall 2023 / Fall 2019)",
    fill = "Recovery Category",
    caption = "Data: SEPTA Average Daily Ridership | Minimum 100 Riders in 2019"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

top_recovery_plot

Route Recovery Distribution

Code
# Histogram showing distribution of recovery ratios.
recovery_dist_plot <- ggplot(route_recovery, aes(x = Recovery_Ratio)) +
  geom_histogram(bins = 40, fill = "#3498DB", color = "white", alpha = 0.8) +
  geom_vline(xintercept = 1.0, linetype = "dashed", color = "#E74C3C", linewidth = 1) +
  geom_vline(xintercept = median(route_recovery$Recovery_Ratio), 
             linetype = "dashed", color = "gray30", linewidth = 1) +
  annotate("text", x = 1.0, y = Inf, label = "Full Recovery", 
           vjust = 2, hjust = 1.1, color = "#E74C3C", fontface = "bold") +
  annotate("text", x = median(route_recovery$Recovery_Ratio), y = Inf, 
           label = sprintf("Median = %.2f", median(route_recovery$Recovery_Ratio)), 
           vjust = 2, hjust = -0.1, color = "gray30", fontface = "bold") +
  scale_x_continuous(breaks = seq(0, 2, 0.2)) +
  labs(
    title = "Distribution of Route-Level Recovery Ratios",
    subtitle = "Most routes have not returned to pre-pandemic ridership levels.",
    x = "Recovery Ratio (Fall 2023 / Fall 2019)",
    y = "Number of Routes",
    caption = "Data: SEPTA Average Daily Ridership | Routes with 100+ Riders in 2019"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 14))

recovery_dist_plot

Individual Route Time Series (Top 5)

Code
# Top 5 routes by 2019 ridership for detailed temporal analysis.
top5_routes_2019 <- df_route_clean %>%
  filter(Year == 2019) %>%
  group_by(Route) %>%
  summarize(Avg_2019 = mean(Average_Daily_Ridership), .groups = "drop") %>%
  arrange(desc(Avg_2019)) %>%
  head(5) %>%
  pull(Route)

# Filter data for top 5 routes.
df_route_top5 <- df_route_clean %>%
  filter(Route %in% top5_routes_2019)

# Faceted time series plot.
route_ts_plot <- ggplot(df_route_top5, aes(x = Date, y = Average_Daily_Ridership)) +
  geom_line(color = "#2980B9", linewidth = 0.8) +
  geom_vline(xintercept = as.Date("2020-03-01"), 
             linetype = "dashed", color = "#E74C3C", alpha = 0.7) +
  facet_wrap(~Route, scales = "free_y", ncol = 2) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Temporal Trends for Top 5 Bus Routes (2019-2024)",
    subtitle = "Red line marks pandemic showing varied recovery trajectories.",
    x = "Date (Month)",
    y = "Average Daily Ridership",
    caption = "Data: SEPTA Average Daily Ridership By Route"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    strip.text = element_text(face = "bold", size = 12)
  )

route_ts_plot

Bus vs. Regional Rail

Load Modal Data

Code
# Load average daily ridership by mode for cross-modal analysis.
df_mode <- read.csv("data/septa/Average_Daily_Ridership_By_Mode.csv")

# Clean and prepare modal data.
df_mode_clean <- df_mode %>%
  mutate(
    Date = make_date(Calendar_Year, Calendar_Month),
    Year = Calendar_Year,
    Month = Calendar_Month
  ) %>%
  filter(!is.na(Date), !is.na(Average_Daily_Ridership)) %>%
  filter(Mode %in% c("Bus", "Regional Rail", "Heavy Rail"))

# Summary of modal data.
df_mode_clean %>%
  group_by(Mode) %>%
  summarize(
    Mean_Ridership = mean(Average_Daily_Ridership),
    Min_Date = min(Date),
    Max_Date = max(Date),
    .groups = "drop"
  ) %>%
  kable(digits = 0) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Mode Mean_Ridership Min_Date Max_Date
Bus 328238 2019-01-01 2025-04-01
Heavy Rail 174765 2019-01-01 2025-04-01
Regional Rail 66871 2019-01-01 2025-04-01

Recovery Index by Mode

Code
# Recovery index relative to 2019 baseline for each mode.
df_mode_index <- df_mode_clean %>%
  group_by(Mode) %>%
  mutate(Baseline_2019 = mean(Average_Daily_Ridership[Year == 2019], na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(Recovery_Index = (Average_Daily_Ridership / Baseline_2019) * 100) %>%
  filter(Year >= 2019, Year <= 2024)

# Recovery index plot.
recovery_index_plot <- ggplot(df_mode_index, aes(x = Date, y = Recovery_Index, color = Mode)) +
  geom_line(linewidth = 1.3, alpha = 0.9) +
  geom_hline(yintercept = 100, linetype = "dashed", color = "gray30", linewidth = 0.8) +
  annotate("text", x = min(df_mode_index$Date), y = 105, 
           label = "Pre-Pandemic Baseline (100)", hjust = 0, fontface = "bold") +
  geom_vline(xintercept = as.Date("2020-03-01"), 
             linetype = "dotted", color = "gray50", alpha = 0.7) +
  scale_color_manual(values = c("Bus" = "#E74C3C", "Regional Rail" = "#3498DB", "Heavy Rail" = "#9B59B6")) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "SEPTA Recovery Index by Mode (2019 = 100)",
    subtitle = "Bus shows faster recovery than Regional Rail indicating modal shift patterns.",
    x = "Date (Month)",
    y = "Ridership Index (2019 = 100)",
    color = "Mode",
    caption = "Data: SEPTA Average Daily Ridership By Mode"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

recovery_index_plot

Code
# Current recovery percentages in most recent month.
df_mode_index %>%
  filter(Date == max(Date)) %>%
  select(Mode, Recovery_Index) %>%
  arrange(desc(Recovery_Index)) %>%
  mutate(Recovery_Index = sprintf("%.1f%%", Recovery_Index)) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Mode Recovery_Index
Bus 76.5%
Heavy Rail 67.8%
Regional Rail 65.4%

Bus vs. Regional Rail Correlation

Code
# Pivot data for bus vs. rail comparison.
df_mode_pivot <- df_mode_clean %>%
  filter(Mode %in% c("Bus", "Regional Rail")) %>%
  select(Date, Mode, Average_Daily_Ridership) %>%
  pivot_wider(names_from = Mode, values_from = Average_Daily_Ridership) %>%
  filter(!is.na(Bus) & !is.na(`Regional Rail`))

# Calculate correlation.
bus_rail_cor <- cor(df_mode_pivot$Bus, df_mode_pivot$`Regional Rail`, use = "complete.obs")

# Scatter plot showing relationship.
bus_rail_scatter <- ggplot(df_mode_pivot, aes(x = `Regional Rail`, y = Bus)) +
  geom_point(aes(color = year(Date)), size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "#E74C3C", linewidth = 1.2) +
  scale_color_viridis_c(option = "turbo", name = "Year") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  annotate("text", x = min(df_mode_pivot$`Regional Rail`) * 1.05, 
           y = max(df_mode_pivot$Bus) * 0.95,
           label = sprintf("Correlation: %.3f", bus_rail_cor),
           size = 5, fontface = "bold", hjust = 0) +
  labs(
    title = "Bus vs. Regional Rail Ridership Correlation",
    subtitle = "Strong positive correlation suggests system-wide demand drivers.",
    x = "Regional Rail Average Daily Ridership",
    y = "Bus Average Daily Ridership",
    caption = "Data: SEPTA Average Daily Ridership 2019-2024"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 14))

bus_rail_scatter

Regional Rail Station

Load Regional Rail Data

Code
# Load regional rail station summary for spatial rail-bus interaction.
df_rr <- read.csv("data/septa/Regional_Rail_Station_Summary.csv")

# Clean and prepare regional rail data.
df_rr_clean <- df_rr %>%
  filter(!is.na(Boards), !is.na(Leaves)) %>%
  mutate(
    Total_Activity = Boards + Leaves,
    Station_Line = paste0(Station, " (", Line, ")")
  )

# Summary stats.
cat(sprintf("Unique Stations: %s\n", n_distinct(df_rr_clean$Station)))
Unique Stations: 155
Code
cat(sprintf("Unique Lines: %s\n", n_distinct(df_rr_clean$Line)))
Unique Lines: 16
Code
cat(sprintf("Years Covered: %s to %s\n", min(df_rr_clean$Year), max(df_rr_clean$Year)))
Years Covered: 2017 to 2024

Station Commuter Score

Code
# Commuter score based on weekday vs. weekend balance.
rr_station_balance <- df_rr_clean %>%
  filter(Year == 2023) %>%
  group_by(Station, Line, Service_Ty, Latitude, Longitude) %>%
  summarize(Total_Activity = sum(Boards + Leaves), .groups = "drop") %>%
  pivot_wider(names_from = Service_Ty, values_from = Total_Activity, values_fill = 0) %>%
  rename(Weekend_Activity = Saturday) %>%
  mutate(
    Commuter_Score = (Weekday - Weekend_Activity) / (Weekday + Weekend_Activity + 1),
    Station_Type = case_when(
      Commuter_Score > 0.7 ~ "Strong Commuter",
      Commuter_Score > 0.4 ~ "Mixed Use",
      TRUE ~ "All-Day/Leisure"
    )
  ) %>%
  filter(Weekday + Weekend_Activity > 100)

# Regional rail station commuter classification (2023).
rr_station_balance %>%
  count(Station_Type, name = "N_Stations") %>%
  mutate(Pct = N_Stations / sum(N_Stations) * 100) %>%
  kable(digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Station_Type N_Stations Pct
All-Day/Leisure 37 22.2
Mixed Use 93 55.7
Strong Commuter 37 22.2

Station Commuter Score Viz

Code
# Scatter plot showing weekday vs. weekend activity.
station_balance_plot <- ggplot(rr_station_balance, 
                               aes(x = Weekday + 1, y = Weekend_Activity + 1, 
                                   color = Commuter_Score)) +
  geom_point(size = 4, alpha = 0.7) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray30", linewidth = 0.8) +
  scale_color_gradient2(
    low = "#27AE60", mid = "#F39C12", high = "#E74C3C",
    midpoint = 0.5,
    name = "Commuter Score"
  ) +
  scale_x_log10(labels = comma) +
  scale_y_log10(labels = comma) +
  annotate("text", x = 500, y = 850, label = "All-Day Stations (Weekday = Weekend)", 
           hjust = 0.5, vjust = -0.5, fontface = "bold", color = "gray30") +
  labs(
    title = "Regional Rail Station Role: Weekday vs. Weekend Activity (2023)",
    subtitle = "Diagonal Line = Balanced Use | Above Line = Weekend-Oriented | Below = Commuter-Oriented",
    x = "Weekday Total Activity (Boards + Leaves) [Log Scale]",
    y = "Weekend Total Activity (Saturday) [Log Scale]",
    caption = "Data: SEPTA Regional Rail Station Summary | Stations with 100+ Total Activity"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right"
  )

station_balance_plot

Top Regional Rail Stations

Code
# Top 20 regional rail stations by weekday activity.
top_rr_stations <- rr_station_balance %>%
  arrange(desc(Weekday)) %>%
  head(20)

top_rr_stations %>%
  select(Station, Line, Weekday, Weekend_Activity, Commuter_Score, Station_Type) %>%
  kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Station Line Weekday Weekend_Activity Commuter_Score Station_Type
Suburban Station Trunk 23880 3960 0.72 Strong Commuter
30th Street Station Trunk 10074 3320 0.50 Mixed Use
Jefferson Trunk 9936 5238 0.31 All-Day/Leisure
Penn Medicine Media/Wawa 1856 430 0.62 Mixed Use
Penn Medicine Airport 1456 256 0.70 Strong Commuter
Fox Chase Fox Chase 1390 164 0.79 Strong Commuter
Trenton Transit Center Trenton 1346 1613 -0.09 All-Day/Leisure
Penn Medicine Warminster 1262 155 0.78 Strong Commuter
Warminster Warminster 1229 277 0.63 Mixed Use
Bryn Mawr Paoli/Thorndale 1217 520 0.40 Mixed Use
Cornwells Heights Trenton 1146 129 0.80 Strong Commuter
Lansdale Lansdale/Doylestown 1131 652 0.27 All-Day/Leisure
Wynnewood Paoli/Thorndale 1118 326 0.55 Mixed Use
Jenkintown-Wyncote Warminster 1094 419 0.45 Mixed Use
Overbrook Paoli/Thorndale 1083 460 0.40 Mixed Use
Torresdale Trenton 1072 160 0.74 Strong Commuter
Wilmington Warminster 1061 619 0.26 All-Day/Leisure
North Wales Lansdale/Doylestown 1050 343 0.51 Mixed Use
Paoli Paoli/Thorndale 1034 389 0.45 Mixed Use
Ardmore Paoli/Thorndale 1013 586 0.27 All-Day/Leisure

Regional Rail Spatial Distribution

Code
# Spatial map of regional rail stations.
rr_spatial_plot <- rr_station_balance %>%
  ggplot() +
  geom_sf(data = tract_map_data, fill = NA, color = "gray60", linewidth = 0.25) +
  geom_point(
    aes(x = Longitude, y = Latitude, size = Weekday, color = Station_Type),
    alpha = 0.7
    ) +
  scale_size_continuous(range = c(2, 10), labels = comma, name = "Weekday Activity") +
  scale_color_manual(
    values = c("Strong Commuter" = "#E74C3C", "Mixed Use" = "#F39C12", "All-Day/Leisure" = "#27AE60"),
    name = "Station Type"
    ) +
  labs(
    title = "Regional Rail Station Distribution and Activity (2023)",
    subtitle = "Point size indicates weekday activity level | Color shows commuter orientation.",
    x = "Longitude",
    y = "Latitude",
    caption = "Data: SEPTA Regional Rail Station Summary"
    ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right",
    panel.grid = element_line(color = "gray90")
    ) +
  coord_sf(
    xlim = c(min_lon, max_lon),
    ylim = c(min_lat, max_lat)
    )

rr_spatial_plot

Bus-Rail Interactions

Feeder Routes: Bus routes connecting to major rail stations.

Competitive Corridors: Parallel bus and rail service on same routes.

Modal Shift: Changes in bus ridership near rail stations during recovery.

Note: Full analysis requires spatial join of tract centroids to rail station buffers.

Potential Distance to Rail Station Feature?

Tracts farther from rail are more bus-dependent, so there could be a negative coefficient meaning the greater the distance from the rail, the higher the bus ridership. Could identify underserved areas for new rail or BRT investment!

Stop-Route Density Feature?

Stops per square mile by tract could be an option, so we could count count stops per tract and normalize by tract area?

A route diversity FE could be created too, so the number of unique routes serving each tract? Or we could actually create an index which could be better, but I don’t think it’s in the assignment toolbox.

Tracts with more stops/routes likely have higher ridership, so service expansion should prioritize low-density areas.

More Insights

Key Stuff from Contextual SEPTA Data

Stop-Level Patterns

  • Top 10% of stops account for 59.7% of boardings.
  • Mean weekday dominance: 52.0%.
  • Spatial hotspots concentrated in Center City and transit hubs.

Route-Level Recovery

  • Median recovery ratio: 0.72.
  • Routes with full recovery: 1.526718.
  • High variation suggests route-specific interventions needed.

Modal Dynamics

  • Bus-Rail correlation: 0.969.
  • Bus recovered faster than Regional Rail post-pandemic.
  • Modal share shift toward bus indicates changing commute patterns.

Bus-Rail Integration

  • Strong commuter stations may compete with parallel bus routes.
  • Mixed-use stations create feeder bus opportunities.
  • Distance to rail is key predictor for bus dependency.

Implications for Tract-Level Model

  • Stop density metric enhances tract-level service quality proxy.
  • Route diversity predicts resilience to service disruptions.
  • Rail proximity captures modal competition effects.
  • Recovery varies by route suggesting heterogeneous treatment effects.

Data Integration?

Spatial Joins

  • Join stops to census tracts using st_join().
  • Calculate stops_per_tract and route_diversity_index.
  • Compute distance_to_nearest_rail_station.

Aggregate Metrics

  • Sum weekday boardings by tract from stop data.
  • Validate against Bus_Ridership_by_Census_Tract.csv.
  • Check correlation between stop-level and tract-level totals.

Temporal Features from Routes

  • Identify routes serving each tract (spatial intersection).
  • Calculate weighted average route recovery ratio per tract.
  • Create tract-level route stability index.

Modal Competition Variables

  • Flag tracts within 0.5 mile of high-volume rail stations.
  • Calculate modal substitution index (bus vs. rail trends).
  • Interaction term: rail_proximity x post_pandemic.

Model Enhancement

  • Add stop density as service quality predictor.
  • Include route diversity as resilience metric.
  • Test rail distance coefficient significance.
  • Estimate heterogeneous effects by rail accessibility.